Extracting Chinese Multi-Word Units from Large-Scale Balanced Corpus
نویسندگان
چکیده
Automatic Multi-word Units Extraction is an important issue in Natural Language Processing. This paper has proposed a new statistical method based on a large-scale balanced corpus to extract multi-word units. We have used two improved traditional parameters: mutual information and log-likelihood ratio, and have increased the precision for the top 10,000 words extracted through the method to 80.13%. The results of the research indicate that this method is more efficient and robust than previous multi-word units extraction methods.
منابع مشابه
A Corpus-based Machine Translation Method of Term Extraction in LSP Texts
To tackle the problems of term extraction in language specific field, this paper proposes a method of coordinating use of corpus and machine translation system in extracting terms in LSP text. A comparable corpus built for this research contains 167 English texts and 229 Chinese texts with around 600,000 English tokens and 900,000 Chinese characters. The corpus is annotated with mega-informatio...
متن کاملDesign and Prototype of a Large-Scale and Fully Sense-Tagged Corpus
Sense tagged corpus plays a very crucial role to Natural Language Processing, especially on the research of word sense disambiguation and natural language understanding. Having a large-scale Chinese sense tagged corpus seems to be very essential, but in fact, such large-scale corpus is the critical deficiency at the current stage. This paper is aimed to design a large-scale Chinese full text se...
متن کاملWEMOTE - Word Embedding based Minority Oversampling Technique for Imbalanced Emotion and Sentiment Classification
Imbalanced training data always puzzles the supervised learning based emotion and sentiment classification. Several existing research showed that data sparseness and small disjuncts are the two major factors affecting the classification. Target to these two problems, this paper presents a word embedding based oversampling method. Firstly, a large-scale text corpus is used to train a continuous ...
متن کاملWhen Frequency Data Meet Dispersion Data in the Extraction of Multi-word Units from a Corpus: A Study of Trigrams in Chinese
One of the main approaches to extract multi-word units is the frequency threshold approach, but the way this approach considers dispersion data still leaves a lot to be desired. This study adopts Gries’s (2008) dispersion measure to extract trigrams from a Chinese corpus, and the results are compared with those of the frequency threshold approach. It is found that the overlap between the two ap...
متن کاملGlobal Syllable Vectors for Building TTS Front-End with Deep Learning
Recent vector space representations of words have succeeded in capturing syntactic and semantic regularities. In the context of text-to-speech (TTS) synthesis, a front-end is a key component for extracting multi-level linguistic features from text, where syllable acts as a link between lowand high-level features. This paper describes the use of global syllable vectors as features to build a fro...
متن کامل